Scientific Programming with Python
Aims
Considered Applications
Jupyter is installed as a Python module.
To install it locally with conda
More Kernels available
mamba install -y plotly_expressipython is a turbo-charged interpreter on the command lineJupyter Lab vs. Notebook
ipython notebookRemarks
.ipynb files are JSON filesjupyter nbconvert --clear-output --to notebook --output=OUTPUT INPUT.ipynbAdvantages
Drawbacks
What is Polars?
mamba install -y polarsPolars vs. Pandas
In [1]: import polars as pl
...: iris_data = pl.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")
...: iris_data.head()
Out[1]:
shape: (5, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal.length ┆ sepal.width ┆ petal.length ┆ petal.width ┆ variety │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ Setosa │
│ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ Setosa │
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ Setosa │
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ Setosa │
│ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ Setosa │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
In [2]: iris_data.write_csv("/tmp/out.tsv", separator="\t")
In [3]: with open("/tmp/out.tsv") as inputf:
...: print(inputf.readline())
...: print(inputf.readline())
...:
sepal.length sepal.width petal.length petal.width variety
5.1 3.5 1.4 0.2 Setosa
Original data:
date,away_team,away_points,home_team,home_points
"Fri, Apr 1, 2016",Philadelphia 76ers,91,Charlotte Hornets,100
"Fri, Apr 1, 2016",Dallas Mavericks,98,Detroit Pistons,89
"Fri, Apr 1, 2016",Brooklyn Nets,91,New York Knicks,105
"Fri, Apr 1, 2016",Cleveland Cavaliers,110,Atlanta Hawks,108
"Fri, Apr 1, 2016",Toronto Raptors,99,Memphis Grizzlies,95
Load and clean the data.
import polars as pl
games_pl = (
pl.read_csv("nba.csv")
.filter(~pl.all(pl.all().is_null()))
.with_columns(
pl.col("date").str.strptime(pl.Date, "%a, %b %d, %Y"),
)
.sort("date")
.with_row_count("game_id")
)
games_pl.head()Output
┌─────────┬────────────┬─────────────────────┬─────────────┬───────────────────┬─────────────┐
│ game_id ┆ date ┆ away_team ┆ away_points ┆ home_team ┆ home_points │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ date ┆ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪════════════╪═════════════════════╪═════════════╪═══════════════════╪═════════════╡
│ 0 ┆ 2016-04-01 ┆ Philadelphia 76ers ┆ 91 ┆ Charlotte Hornets ┆ 100 │
│ 1 ┆ 2016-04-01 ┆ Dallas Mavericks ┆ 98 ┆ Detroit Pistons ┆ 89 │
│ 2 ┆ 2016-04-01 ┆ Brooklyn Nets ┆ 91 ┆ New York Knicks ┆ 105 │
│ 3 ┆ 2016-04-01 ┆ Cleveland Cavaliers ┆ 110 ┆ Atlanta Hawks ┆ 108 │
│ 4 ┆ 2016-04-01 ┆ Toronto Raptors ┆ 99 ┆ Memphis Grizzlies ┆ 95 │
└─────────┴────────────┴─────────────────────┴─────────────┴───────────────────┴─────────────┘
pivot - “long” to “wide” formatmelt - “wide” to “long” formatjoin - join two tablesBut - why?!
scikit-learnrpy2Installing dependencies
mamba create -y -n polars-example \
polars pandas pyarrow
Polars to Pandas
Pandas to Polars
NumPy
SciPy
👉 Mostly low level characteristics, many other libraries build on top.
Saving single array to .npy format
Saving multiple arrays to .npz format (ZIP of .npy)
Loading single array to .npy format
In [1]: import numpy as np
...:
...: mat = np.load("mat.npy")
...: mat
Out[1]:
array([[1, 2],
[3, 4]])Loading multiple arrays from .npz:
shape attribute determines the dimensionPlain Python
In [1]: import timeit
...:
...: a = range(100_000_000)
...: b = range(100_000_000)
...:
...: timeit.timeit("[x + y for x, y in zip(a, b)]", globals=globals(), number=1)
...:
Out[1]: 8.240364263998345With numpy
In [1]: import numpy as np
...: from scipy import linalg
...:
...: A = np.array([[1,3,5],[2,5,1],[2,3,8]])
...: A
Out[1]:
array([[1, 3, 5],
[2, 5, 1],
[2, 3, 8]])
In [2]: linalg.inv(A)
Out[2]:
array([[-1.48, 0.36, 0.88],
[ 0.56, 0.08, -0.36],
[ 0.16, -0.12, 0.04]])
In [3]: A.dot(linalg.inv(A)) #double check
Out[3]:
array([[ 1.00000000e+00, -1.11022302e-16, 4.85722573e-17],
[ 3.05311332e-16, 1.00000000e+00, 7.63278329e-17],
[ 2.22044605e-16, -1.11022302e-16, 1.00000000e+00]])Summary
Potential Exercises
average, quantile, etc.rpy2Why interface to R with Python?
statsmodels, linearmodels, …)rpy2 Installation
mamba create -y -n rpy2-example \
python=3.10 rpy2 jupyterlab pandas \
pyarrow polars
conda activate rpy2-example
Maybe the following is enough:
subprocess.call
If so - do it!
If you need more advanced features, stay tuned!
rpy2 (1)Hello World Pi!
$ R --vanilla
> pi
[1] 3.141593
In Python:
rpy2 (2)Student’s t-test
In [1]: import rpy2.robjects as robjects
...: from rpy2.robjects.packages import importr
...:
...: stats = importr('stats')
...:
...: group1 = robjects.FloatVector([23.5, 25.1, 28.3, 29.8, 30.5])
...: group2 = robjects.FloatVector([20.1, 22.5, 25.3, 27.9, 29.2])
...:
...: result = stats.t_test(group1, group2)
...:
...: print(result)
Welch Two Sample t-test
data: c(23.5, 25.1, 28.3, 29.8, 30.5) and c(20.1, 22.5, 25.3, 27.9, 29.2)
t = 1.1311, df = 7.656, p-value = 0.2922
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.573733 7.453733
sample estimates:
mean of x mean of y
27.44 25.00
In [2]: result
Out[2]:
<rpy2.robjects.vectors.ListVector object at 0x7f668c5796c0> [RTYPES.VECSXP]
R classes: ('htest',)
[FloatSexp..., FloatSexp..., FloatSexp..., FloatSexp..., ..., FloatSexp..., StrSexpVe..., StrSexpVe..., StrSexpVe...]
statistic: <class 'rpy2.robjects.vectors.FloatVector'>
<rpy2.robjects.vectors.FloatVector object at 0x7f668c6a3700> [RTYPES.REALSXP]
R classes: ('numeric',)
[1.131085]
parameter: <class 'rpy2.robjects.vectors.FloatVector'>
<rpy2.robjects.vectors.FloatVector object at 0x7f668c700b80> [RTYPES.REALSXP]
R classes: ('numeric',)
[7.656022]
p.value: <class 'rpy2.robjects.vectors.FloatVector'>
<rpy2.robjects.vectors.FloatVector object at 0x7f668c700f80> [RTYPES.REALSXP]
R classes: ('numeric',)
[0.292203]
conf.int: <class 'rpy2.robjects.vectors.FloatVector'>
<rpy2.robjects.vectors.FloatVector object at 0x7f668c57ad00> [RTYPES.REALSXP]
R classes: ('numeric',)
[-2.573733, 7.453733]
...
null.value: <class 'rpy2.robjects.vectors.FloatVector'>
<rpy2.robjects.vectors.FloatVector object at 0x7f668c579780> [RTYPES.REALSXP]
R classes: ('numeric',)
[2.157220]
stderr: <class 'rpy2.robjects.vectors.StrVector'>
<rpy2.robjects.vectors.StrVector object at 0x7f668c6c1100> [RTYPES.STRSXP]
R classes: ('character',)
['two.sided']
alternative: <class 'rpy2.robjects.vectors.StrVector'>
<rpy2.robjects.vectors.StrVector object at 0x7f668c549ac0> [RTYPES.STRSXP]
R classes: ('character',)
['Welch Two Sample t-test']
method: <class 'rpy2.robjects.vectors.StrVector'>
<rpy2.robjects.vectors.StrVector object at 0x7f668c5499c0> [RTYPES.STRSXP]
R classes: ('character',)
['c(23.5, 25.1, 28.3, 29.8, 30.5) and c(20.1, 22.5...]rpy2 (3)From Pandas to R
In [1]: import pandas as pd
...: import rpy2.robjects as ro
...: from rpy2.robjects.packages import importr
...: from rpy2.robjects import pandas2ri
...:
...: pd_df = pd.DataFrame({'int_values': [1,2,3],
...: 'str_values': ['abc', 'def', 'ghi']})
...:
...: with (ro.default_converter + pandas2ri.converter).context():
...: r_from_pd_df = ro.conversion.get_conversion().py2rpy(pd_df)
In [2]: base = importr('base')
...:
...: with (ro.default_converter + pandas2ri.converter).context():
...: df_summary = base.summary(pd_df)
...: print(df_summary)
int_values str_values
Min. :1.0 Length:3
1st Qu.:1.5 Class :character
Median :2.0 Mode :character
Mean :2.0
3rd Qu.:2.5
Max. :3.0rpy2 (4)From R to Pandas
In [1]: import pandas as pd
...: import rpy2.robjects as ro
...: from rpy2.robjects.packages import importr
...: from rpy2.robjects import pandas2ri
...:
...: r_df = ro.DataFrame({'int_values': ro.IntVector([1,2,3]),
...: 'str_values': ro.StrVector(['abc', 'def', 'ghi'])})
In [2]: base = importr('base')
...:
...: with (ro.default_converter + pandas2ri.converter).context():
...: df_summary = base.summary(r_df)
...: print(df_summary)
int_values str_values
Min. :1.0 Length:3
1st Qu.:1.5 Class :character
Median :2.0 Mode :character
Mean :2.0
3rd Qu.:2.5
Max. :3.0rpy2 (5)Installing Packages
import rpy2.robjects.packages as rpackages
utils = rpackages.importr("utils")
utils.chooseCRANmirror(ind=1) # select the first mirror in the list
from rpy2.robjects.vectors import StrVector
packnames = ("ggplot2", "lattice", "lazyeval")
names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
if len(names_to_install) > 0:
utils.install_packages(StrVector(names_to_install))rpy2 (6)Plotting with ggplot2
import numpy as np
import pandas as pd
import rpy2.robjects.packages as packages
import rpy2.robjects.lib.ggplot2 as ggplot2
import rpy2.robjects as ro
R = ro.r
datasets = packages.importr("datasets")
mtcars = packages.data(datasets).fetch("mtcars")["mtcars"]
gp = ggplot2.ggplot(mtcars)
pp = (gp
+ ggplot2.aes_string(x="wt", y="mpg")
+ ggplot2.geom_point(ggplot2.aes_string(colour="qsec"))
+ ggplot2.scale_colour_gradient(low="yellow", high="red")
+ ggplot2.geom_smooth(method="auto")
+ ggplot2.labs(title="mtcars", x="wt", y="mpg"))
pp.plot()
R("dev.copy(png,'/tmp/out.png')")scikit-learnscikit-learn is a a popular open-source machine learning library for Pythonimport numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate synthetic data
data, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Create K-Means model
kmeans = KMeans(n_clusters=4)
# Fit the model to the data
kmeans.fit(data)
# Predict clusters for each data point
labels = kmeans.labels_
# Get cluster centers
cluster_centers = kmeans.cluster_centers_
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='x', s=200)
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create Linear Regression model
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
# Visualize the data and regression line
plt.scatter(X_test, y_test, label='Test Data')
plt.plot(X_test, y_pred, color='red', label='Regression Line')
plt.title(f'Linear Regression (MSE: {mse:.2f})')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
# Generate synthetic classification data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create Logistic Regression model
model = LogisticRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Visualize the data and decision boundary
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='coolwarm', marker='o', label='True Class')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
# Create a mesh to plot decision boundary
x_min, x_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1
y_min, y_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
plt.title(f'Logistic Regression (Accuracy: {accuracy:.2f})')
plt.legend()
plt.show()Create a new conda environment (this may take a while):
mamba create -y -n python-tf tensorflow-gpu
conda activate python-tf
Verify that it works in priciple:
python --version
# OUTPUT: Python 3.9.10
python -c 'import tensorflow; print(tensorflow.__version__)'
# OUTPUT: 2.6.2
python -c 'import tensorflow as tf; print(tf.config.list_physical_devices())'
# OUTPUT: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
# OUTPUT: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (hpc-cpu-88): /proc/driver/nvidia/version does not exist
# OUTPUT: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
Deep learning is all fun and giggles until you realize that it is slow on CPUs.
hpc-login-1 $ srun --mem=10g --partition=gpu --gres=gpu:tesla:1 --pty bash -i
hpc-gpu-1 $ conda activate python-tf
hpc-gpu-1 $ python -c 'import tensorflow as tf; print(tf.config.list_physical_devices())'
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
$ python
>>> import tensorflow as tf
>>> mnist = tf.keras.datasets.mnist
>>> (x_train, y_train), (x_test, y_test) = mnist.load_data()
>>> x_train, x_test = x_train / 255.0, x_test / 255.0
>>> model = tf.keras.models.Sequential([
... tf.keras.layers.Flatten(input_shape=(28, 28)),
... tf.keras.layers.Dense(128, activation='relu'),
... tf.keras.layers.Dropout(0.2),
... tf.keras.layers.Dense(10)
... ])
>>> predictions = model(x_train[:1]).numpy()
>>> predictions
array([[-0.50569224, 0.26386747, 0.43226188, 0.61226094, 0.09630793,
0.34400576, 0.9819117 , -0.3693726 , 0.5221357 , 0.3323232 ]],
dtype=float32)
>>> tf.nn.softmax(predictions).numpy()
array([[0.04234391, 0.09141268, 0.10817807, 0.12951255, 0.07731011,
0.09903987, 0.18743432, 0.04852816, 0.11835073, 0.09788957]],
dtype=float32)
>>> loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
>>> loss_fn(y_train[:1], predictions).numpy()
2.3122327
>>> model.compile(optimizer='adam',
... loss=loss_fn,
... metrics=['accuracy'])
>>> model.fit(x_train, y_train, epochs=5)
2022-03-09 17:53:47.237997: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.2918 - accuracy: 0.9151
Epoch 2/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1444 - accuracy: 0.9561
Epoch 3/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1082 - accuracy: 0.9674
Epoch 4/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0898 - accuracy: 0.9720
Epoch 5/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0773 - accuracy: 0.9756
<keras.callbacks.History object at 0x154e81360190>
>>> model.evaluate(x_test, y_test, verbose=2)
313/313 - 0s - loss: 0.0713 - accuracy: 0.9785
[0.0713074803352356, 0.9785000085830688]
>>> probability_model = tf.keras.Sequential([
... model,
... tf.keras.layers.Softmax()
... ])
>>> probability_model(x_test[:5])
<tf.Tensor: shape=(5, 10), dtype=float32, numpy=
array([[1.2339272e-06, 6.5599060e-10, 1.0560590e-06, 5.9356184e-06,
5.3691075e-12, 1.4447859e-07, 5.4218874e-13, 9.9996936e-01,
1.0347234e-07, 2.2147648e-05],
[2.9887938e-06, 6.8461006e-05, 9.9991941e-01, 7.2003731e-06,
2.9751782e-13, 8.2818183e-08, 1.4307782e-06, 2.3203837e-13,
4.7433215e-07, 2.9504194e-14],
[1.8058477e-06, 9.9928612e-01, 7.8716243e-05, 3.9140195e-06,
3.0842333e-05, 9.4537208e-06, 2.2774333e-05, 4.5549971e-04,
1.1015874e-04, 6.9138093e-07],
[9.9978787e-01, 3.0206781e-08, 2.8528208e-05, 8.5581682e-08,
1.3851340e-07, 2.3634559e-06, 1.8480707e-05, 1.0153375e-04,
1.1583331e-07, 6.0887167e-05],
[6.4914235e-07, 2.5808356e-08, 1.8225538e-06, 2.3215563e-09,
9.9588013e-01, 4.6049720e-08, 3.8903639e-07, 2.9772724e-05,
4.3141077e-07, 4.0867776e-03]], dtype=float32)>
>>> exit()tf_script.py
#/usr/bin/env python
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print(tf.config.list_physical_devices())
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
print(predictions)
print(tf.nn.softmax(predictions).numpy())
# ... and so on ;-)tf_job.sh
Submit by
sbatch tf_job.sh
Log output:
$ cat tf-out.txt
2022-03-09 18:05:54.628846: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-09 18:05:56.999848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30988 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0
TensorFlow version: 2.6.2
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[[-0.07757086 0.04676083 0.9420195 -0.59902835 -0.26286742 -0.392514
0.3231195 -0.17169198 0.3480805 0.37013203]]
[[0.07963609 0.09017922 0.22075593 0.04727634 0.06616627 0.05812084
0.11888511 0.07248258 0.12188996 0.12460768]]
🫵 Where can you apply what you have learned in your PhD project?
… but all for this session
Recap
scikit-learn